options(encoding = 'UTF-8')
#Loading all the necessary packages
if (!require("CASdatasets")) install.packages("CASdatasets", repos = "http://cas.uqam.ca/pub/R/", type="source")
if (!require("caret")) install.packages("caret")
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("mgcv")) install.packages("mgcv")
if (!require("plyr")) install.packages("plyr")
if (!require("gridExtra")) install.packages("gridExtra")
if (!require("visreg")) install.packages("visreg")
if (!require("MASS")) install.packages("MASS")
if (!require("plotrix")) install.packages("plotrix")
if (!require("rgeos")) install.packages("rgeos", type="source")
if (!require("rgdal")) install.packages("rgdal", type="source")
if (!require("xtable")) install.packages("xtable")
if (!require("formatR")) install.packages("formatR")
if (!require("maptools")) install.packages("maptools")
require("CASdatasets")
require("ggplot2")
require("mgcv")
require("caret")
require("gridExtra")
require("plyr")
require("visreg")
require("MASS")
require("plotrix")
require("rgdal")
require("rgeos")
require("xtable")
require("formatR")
require("maptools")
At the beginning of every data analysis some data manipulations and visualizations are required. The core for this is the tidyverse universe, which is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Learn the tidyverse: See how the tidyverse makes data science faster, easier and more fun with “R for Data Science”. Read it online (https://r4ds.had.co.nz/), buy the book or try another resource from the community.
For visualizations, tidyverse contains ggplot2, see https://ggplot2.tidyverse.org/, and the corresponding book at https://github.com/hadley/ggplot2-book.
## 1) If 'CASdatasets' package can be loaded, then as follows:
data("freMTPLfreq")
freMTPLfreq = subset(freMTPLfreq, Exposure <= 1 & Exposure >= 0 & CarAge <=
25)
# Subsample of the whole dataset
set.seed(85)
folds = createDataPartition(freMTPLfreq$ClaimNb, 0.5)
df_freqMTPLfreq = freMTPLfreq[folds[[1]], ]
# save(df_freqMTPLfreq, file='../dataset.RData')
## 3) If not, then download the file 'freMTPLfreq.RData' from the GitHub
## repository and run the following: data_path <-
## 'C:/Users/juerg/AktuarDataScience/Varia/2019_SSA/'
## load(file=paste(data_path,'df_freMTPLfreq.RData',sep=''))
dataset <- df_freqMTPLfreq
A good idea is to check whether the dataset has been loaded correctly. To do this, the following tools can be used:
head(dataset)
## PolicyID ClaimNb Exposure Power CarAge DriverAge
## 1 1 0 0.09 g 0 46
## 2 2 0 0.84 g 0 46
## 4 4 0 0.45 f 2 38
## 5 5 0 0.15 g 0 41
## 7 7 0 0.81 d 1 27
## 9 9 0 0.76 d 9 23
## Brand Gas Region Density
## 1 Japanese (except Nissan) or Korean Diesel R72 76
## 2 Japanese (except Nissan) or Korean Diesel R72 76
## 4 Japanese (except Nissan) or Korean Regular R31 3003
## 5 Japanese (except Nissan) or Korean Diesel R52 60
## 7 Japanese (except Nissan) or Korean Regular R72 695
## 9 Fiat Regular R31 7887
str(dataset)
## 'data.frame': 205432 obs. of 10 variables:
## $ PolicyID : int 1 2 4 5 7 9 10 14 17 18 ...
## $ ClaimNb : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Exposure : num 0.09 0.84 0.45 0.15 0.81 0.76 0.34 0.19 0.8 0.07 ...
## $ Power : Factor w/ 12 levels "d","e","f","g",..: 4 4 3 4 1 1 6 2 2 2 ...
## $ CarAge : int 0 0 2 0 1 9 0 0 0 0 ...
## $ DriverAge: int 46 46 38 41 27 23 44 33 69 69 ...
## $ Brand : Factor w/ 7 levels "Fiat","Japanese (except Nissan) or Korean",..: 2 2 2 2 2 1 2 2 2 2 ...
## $ Gas : Factor w/ 2 levels "Diesel","Regular": 1 1 2 1 2 2 2 2 2 2 ...
## $ Region : Factor w/ 10 levels "R11","R23","R24",..: 9 9 5 6 9 5 1 1 1 1 ...
## $ Density : int 76 76 3003 60 695 7887 27000 1746 1376 1376 ...
summary(dataset)
## PolicyID ClaimNb Exposure Power
## Min. : 1 Min. :0.00000 Min. :0.002732 f :47709
## 1st Qu.:102981 1st Qu.:0.00000 1st Qu.:0.200000 g :45390
## Median :206768 Median :0.00000 Median :0.530000 e :38329
## Mean :206581 Mean :0.03914 Mean :0.559871 d :33720
## 3rd Qu.:309942 3rd Qu.:0.00000 3rd Qu.:1.000000 h :13346
## Max. :413168 Max. :4.00000 Max. :1.000000 j : 8962
## (Other):17976
## CarAge DriverAge
## Min. : 0.000 Min. :18.00
## 1st Qu.: 3.000 1st Qu.:34.00
## Median : 7.000 Median :44.00
## Mean : 7.417 Mean :45.29
## 3rd Qu.:12.000 3rd Qu.:54.00
## Max. :25.000 Max. :99.00
##
## Brand Gas
## Fiat : 8341 Diesel :102771
## Japanese (except Nissan) or Korean: 39413 Regular:102661
## Mercedes, Chrysler or BMW : 9651
## Opel, General Motors or Ford : 18638
## other : 4947
## Renault, Nissan or Citroen :108267
## Volkswagen, Audi, Skoda or Seat : 16175
## Region Density
## R24 :79767 Min. : 2
## R11 :34778 1st Qu.: 67
## R53 :21032 Median : 287
## R52 :19321 Mean : 1982
## R72 :15506 3rd Qu.: 1408
## R31 :13576 Max. :27000
## (Other):21452
If one needs some help on a function, typing a question mark and the name of the function in the console opens the help file of the function. For instance,
?head
We will now have a descriptive analysis of the portfolio. The different variables available are PolicyID, ClaimNb, Exposure, Power, CarAge, DriverAge, Brand, Gas, Region, Density.
The variable PolicyID related to a unique identifier of the policy. We can check that every policy appears only once in the dataset
length(unique(dataset$PolicyID)) == nrow(dataset)
## [1] TRUE
The Exposure reveals the fraction of the year during which the policyholder is in the portfolio. We can compute the total exposure by summing the policyholders’ exposures. Here we find 115015.5 years.
We can show the number of months of exposure on a table.
table(cut(dataset$Exposure, breaks = seq(from = 0, to = 1, by = 1/12), labels = 1:12))
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 31393 14729 16610 12040 9793 14680 9447 7248 10714 6838 6066 65874
Using the function prop.table, it is possible to represent this information in relative terms show the number of months of exposure on a table.
round(prop.table(table(cut(dataset$Exposure, breaks = seq(from = 0, to = 1,
by = 1/12), labels = 1:12))), 4)
##
## 1 2 3 4 5 6 7 8 9 10
## 0.1528 0.0717 0.0809 0.0586 0.0477 0.0715 0.0460 0.0353 0.0522 0.0333
## 11 12
## 0.0295 0.3207
Alternatively, we can use a barplot !
Exposure.summary = cut(dataset$Exposure, breaks = seq(from = 0, to = 1,by = 1/12))
levels(Exposure.summary) = 1:12
ggplot()+geom_bar(aes(x=Exposure.summary)) + xlab("Number of months") + ggtitle("Exposure in months")
ggplot(dataset, aes(x=ClaimNb))+geom_bar()+
geom_text(stat='count', aes(label=..count..), vjust=-1)+ylim(c(0,210000))+
ylab("")+ xlab("Number of Claims")+ ggtitle("Proportion of policies by number of claims")
We can compute the average claim frequency in this portfolio, taking into account the different exposures.
sum(dataset$ClaimNb)/sum(dataset$Exposure)
Here, we obtain 0.0699.
Let us now look at the other variables.
The variable Power is a categorized variable, related to the power of the car. The levels of the variable are ordered categorically. We can see the different levels of a factor by using the function level in R:
levels(dataset$Power)
## [1] "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
We can see the number of observations in each level of the variable, by using the function table.
table(dataset$Power)
##
## d e f g h i j k l m n o
## 33720 38329 47709 45390 13346 8793 8962 4659 2255 905 624 740
Remember however, that in insurance, exposures may differ from one policyholder to another. Hence, the table above, does NOT measure the exposure in each level of the variable Power. We can use the function ddply to give us the exposure in each level of the variable.
require(plyr)
Power.summary = ddply(dataset, .(Power), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure))
We can show this on a plot as well:
ggplot(Power.summary, aes(x = Power, y = totalExposure, fill = Power)) + geom_bar(stat = "identity") +
ylab("Exposure in years") + geom_text(stat = "identity", aes(label = round(totalExposure,
0), color = Power), vjust = -0.5) + guides(fill = FALSE, color = FALSE)
Let us now look at the observed claim frequency in each level
Power.summary = ddply(dataset, .(Power), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Frequency = sum(ClaimNb)/sum(Exposure))
Power.summary
| Power | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Frequency |
|---|---|---|---|---|
| d | 18711.53 | 33720 | 1148 | 0.06135 |
| e | 22157.35 | 38329 | 1631 | 0.07361 |
| f | 27823.54 | 47709 | 1998 | 0.07181 |
| g | 25678.47 | 45390 | 1672 | 0.06511 |
| h | 6956.80 | 13346 | 505 | 0.07259 |
| i | 4680.11 | 8793 | 369 | 0.07884 |
| j | 4587.50 | 8962 | 357 | 0.07782 |
| k | 2247.44 | 4659 | 188 | 0.08365 |
| l | 1046.09 | 2255 | 78 | 0.07456 |
| m | 475.74 | 905 | 37 | 0.07777 |
| n | 317.64 | 624 | 31 | 0.09760 |
| o | 333.32 | 740 | 26 | 0.07800 |
We can compute the ratio to the portfolio claim frequency.
portfolio.cf = sum(dataset$ClaimNb)/sum(dataset$Exposure)
ggplot(Power.summary) + geom_bar(stat = "identity", aes(x = Power, y = Obs.Claim.Frequency,
fill = Power)) + geom_line(aes(x = as.numeric(Power), y = portfolio.cf),
color = "red") + guides(fill = FALSE)
The vehicle age, in years. This is the first continuous variable that we encounter (although it only takes discrete values).
ggplot(dataset, aes(x=CarAge)) + geom_bar() + xlab("Age of the Car")
Again, here, the exposures are not considered on the histogram. We can use ddply to correct this.
CarAge.summary = ddply(dataset, .(CarAge), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure))
CarAge.summary
| CarAge | totalExposure | Number.Observations |
|---|---|---|
| 0 | 4346.42 | 14947 |
| 1 | 9016.91 | 18863 |
| 2 | 8621.68 | 16246 |
| 3 | 7902.55 | 14337 |
| 4 | 7438.18 | 12851 |
| 5 | 7206.40 | 11889 |
| 6 | 6920.09 | 11245 |
| 7 | 6538.41 | 10516 |
| 8 | 6636.01 | 10624 |
| 9 | 6319.48 | 10268 |
| 10 | 6931.74 | 12168 |
| 11 | 6051.96 | 9727 |
| 12 | 5819.12 | 9544 |
| 13 | 5622.09 | 9169 |
| 14 | 4938.09 | 8244 |
| 15 | 4349.96 | 7674 |
| 16 | 3099.81 | 5156 |
| 17 | 2375.68 | 3950 |
| 18 | 1713.78 | 2822 |
| 19 | 1175.23 | 1931 |
| 20 | 732.41 | 1231 |
| 21 | 470.92 | 781 |
| 22 | 316.97 | 505 |
| 23 | 207.92 | 328 |
| 24 | 145.56 | 219 |
| 25 | 118.17 | 197 |
Then, we can plot the data onto a barplot, as before.
ggplot(CarAge.summary, aes(x = CarAge, y = totalExposure)) + geom_bar(stat = "identity") +
ylab("Exposure in years")
We can see a large difference, specially for new cars, which makes sense ! Indeed, let us look at the Exposure for new vehicles, using a boxplot for instance.
ggplot(dataset[dataset$CarAge == 0, ], aes(x = "Exposure", y = Exposure)) +
geom_boxplot() + ggtitle("Exposure of new cars")
Let us now also compute the claim frequency by age of car,
CarAge.summary = ddply(dataset, .(CarAge), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
CarAge.summary
| CarAge | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Freq |
|---|---|---|---|---|
| 0 | 4346.42 | 14947 | 285 | 0.06557 |
| 1 | 9016.91 | 18863 | 653 | 0.07242 |
| 2 | 8621.68 | 16246 | 591 | 0.06855 |
| 3 | 7902.55 | 14337 | 551 | 0.06972 |
| 4 | 7438.18 | 12851 | 528 | 0.07099 |
| 5 | 7206.40 | 11889 | 501 | 0.06952 |
| 6 | 6920.09 | 11245 | 509 | 0.07355 |
| 7 | 6538.41 | 10516 | 502 | 0.07678 |
| 8 | 6636.01 | 10624 | 479 | 0.07218 |
| 9 | 6319.48 | 10268 | 487 | 0.07706 |
| 10 | 6931.74 | 12168 | 527 | 0.07603 |
| 11 | 6051.96 | 9727 | 441 | 0.07287 |
| 12 | 5819.12 | 9544 | 449 | 0.07716 |
| 13 | 5622.09 | 9169 | 369 | 0.06563 |
| 14 | 4938.09 | 8244 | 330 | 0.06683 |
| 15 | 4349.96 | 7674 | 273 | 0.06276 |
| 16 | 3099.81 | 5156 | 197 | 0.06355 |
| 17 | 2375.68 | 3950 | 140 | 0.05893 |
| 18 | 1713.78 | 2822 | 88 | 0.05135 |
| 19 | 1175.23 | 1931 | 48 | 0.04084 |
| 20 | 732.41 | 1231 | 39 | 0.05325 |
| 21 | 470.92 | 781 | 20 | 0.04247 |
| 22 | 316.97 | 505 | 11 | 0.03470 |
| 23 | 207.92 | 328 | 6 | 0.02886 |
| 24 | 145.56 | 219 | 9 | 0.06183 |
| 25 | 118.17 | 197 | 7 | 0.05924 |
and plot it!
ggplot(CarAge.summary, aes(x = CarAge, y = Obs.Claim.Freq)) + geom_point() +
ylab("Observed Claim Frequency") + xlab("Age of the Car") + ylim(c(0, 0.08))
Similarly to the Age of the Car, we can visualize the Age of the Drivers.
DriverAge.summary = ddply(dataset, .(DriverAge), summarize, totalExposure = sum(Exposure), Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
head(DriverAge.summary,9)
| DriverAge | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Freq |
|---|---|---|---|---|
| 18 | 78.37 | 257 | 23 | 0.29347 |
| 19 | 314.15 | 811 | 90 | 0.28648 |
| 20 | 492.48 | 1213 | 112 | 0.22742 |
| 21 | 637.28 | 1528 | 105 | 0.16476 |
| 22 | 794.64 | 1822 | 122 | 0.15353 |
| 23 | 923.73 | 2074 | 124 | 0.13424 |
| 24 | 1073.46 | 2434 | 111 | 0.10340 |
| 25 | 1234.65 | 2765 | 114 | 0.09233 |
| 26 | 1476.38 | 3240 | 157 | 0.10634 |
We can show the Exposures by Age of the Driver
ggplot(DriverAge.summary, aes(x = DriverAge, y = totalExposure)) + geom_bar(stat = "identity",
width = 0.8) + ylab("Exposure in years") + xlab("Age of the Driver")
and the observed claim frequency by Age.
ggplot(DriverAge.summary, aes(x = DriverAge, y = Obs.Claim.Freq)) + geom_point() +
ylab("Observed Claim Frequency") + xlab("Age of the Driver")
The variable Brand is a categorized variable, related to the brand of the car. We can see the different levels of a factor by using the function level in R:
levels(dataset$Brand)
## [1] "Fiat"
## [2] "Japanese (except Nissan) or Korean"
## [3] "Mercedes, Chrysler or BMW"
## [4] "Opel, General Motors or Ford"
## [5] "other"
## [6] "Renault, Nissan or Citroen"
## [7] "Volkswagen, Audi, Skoda or Seat"
Brand.summary = ddply(dataset, .(Brand), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
Brand.summary
| Brand | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Freq |
|---|---|---|---|---|
| Fiat | 4728.26 | 8341 | 357 | 0.07550 |
| Japanese (except Nissan) or Korean | 15499.62 | 39413 | 993 | 0.06407 |
| Mercedes, Chrysler or BMW | 5254.51 | 9651 | 416 | 0.07917 |
| Opel, General Motors or Ford | 10812.52 | 18638 | 847 | 0.07834 |
| other | 2905.09 | 4947 | 232 | 0.07986 |
| Renault, Nissan or Citroen | 66756.93 | 108267 | 4446 | 0.06660 |
| Volkswagen, Audi, Skoda or Seat | 9058.60 | 16175 | 749 | 0.08268 |
require(ggplot2)
ggplot(Brand.summary, aes(x = reorder(Brand, totalExposure), y = totalExposure,
fill = Brand)) + geom_bar(stat = "identity") + coord_flip() + guides(fill = FALSE) +
xlab("") + ylab("Exposure in years")
Let us now look at the claim frequency by Brand of the car.
ggplot(Brand.summary, aes(x = reorder(Brand, Obs.Claim.Freq), y = Obs.Claim.Freq,
fill = Brand)) + geom_bar(stat = "identity") + coord_flip() + guides(fill = FALSE) +
ggtitle("Observed Claim Frequencies by Brand of the car") + xlab("") + ylab("Observed Claim Frequency")
The variable Gas is a categorized variable, related to the fuel of the car. We can see the different levels of a factor by using the function level in R:
levels(dataset$Gas)
## [1] "Diesel" "Regular"
Gas.summary = ddply(dataset, .(Gas), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
ggplot(Gas.summary, aes(x = Gas, y = totalExposure, fill = Gas)) + geom_bar(stat = "identity") +
guides(fill = FALSE)
There seems to be a similar amount of Diesel and Regular gas vehicles in the portfolio. It is generally expected that Diesel have a higher claim frequency. Does this also hold on our dataset ?
ggplot(Gas.summary, aes(x = Gas, y = Obs.Claim.Freq, fill = Gas)) + geom_bar(stat = "identity") +
guides(fill = FALSE)
The variable Region is a categorized variable, related to the region of the place of residence. We can see the different levels of a factor by using the function level in R:
levels(dataset$Region)
## [1] "R11" "R23" "R24" "R25" "R31" "R52" "R53" "R54" "R72" "R74"
What are the Exposures in each region ? What are the observed claim frequencies ?
Region.summary = ddply(dataset, .(Region), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
Region.summary
| Region | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Freq |
|---|---|---|---|---|
| R11 | 15024.48 | 34778 | 1266 | 0.08426 |
| R23 | 1531.38 | 4275 | 94 | 0.06138 |
| R24 | 50927.07 | 79767 | 3289 | 0.06458 |
| R25 | 3274.99 | 5398 | 227 | 0.06931 |
| R31 | 5676.75 | 13576 | 483 | 0.08508 |
| R52 | 10871.82 | 19321 | 760 | 0.06991 |
| R53 | 13889.48 | 21032 | 928 | 0.06681 |
| R54 | 5573.92 | 9455 | 386 | 0.06925 |
| R72 | 7033.85 | 15506 | 516 | 0.07336 |
| R74 | 1211.77 | 2324 | 91 | 0.07510 |
Using the function twoord.plot we can easily show both the Exposures and the observed claim frequencies on the same plot.
twoord.plot(1:10, Region.summary$totalExposure, 1:10, Region.summary$Obs.Claim.Freq,
xlab = "Region", rylim = c(0, 0.1), type = c("bar", "p"), xticklab = Region.summary$Region,
ylab = "Exposure", rylab = "Observed Claim Frequency")
We can plot a map with the observed claim frequencies
# Download shapefile from http://www.diva-gis.org/gData Extract all the
# files from the zip files, in a directory called shapefiles in your working
# directory
area <- rgdal::readOGR("C:/Users/juerg/AktuarDataScience/Varia/2019_SSA/shapefiles/FRA_adm2.shp") # From http://www.diva-gis.org/gData
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\juerg\AktuarDataScience\Varia\2019_SSA\shapefiles\FRA_adm2.shp", layer: "FRA_adm2"
## with 96 features
## It has 11 fields
## Integer64 fields read as strings: ID_0 ID_1 ID_2
Region.summary$id = sapply(Region.summary$Region, substr, 2, 3)
area.points = fortify(area, region = "ID_2") #Convert to data.frame
area.points = merge(area.points, Region.summary[, c("id", "totalExposure", "Obs.Claim.Freq")],
by.x = "id", by.y = "id", all.x = TRUE)
area.points = area.points[order(area.points$order), ] #Has to be ordered correctly to plot.
ggplot(area.points, aes(long, lat, group = group)) + ggtitle("Observed Claim Frequencies") +
geom_polygon(aes(fill = area.points$Obs.Claim.Freq)) + scale_fill_gradient(low = "green",
high = "red", name = "Obs. Claim Freq.", limits = c(0.061, 0.085)) + xlab("Longitude") +
ylab("Latitude")
and the exposures (on a log-scale)…
ggplot(area.points, aes(long, lat, group = group)) + ggtitle("log Exposures in years") +
geom_polygon(aes(fill = log(area.points$totalExposure))) + scale_fill_gradient(low = "blue",
high = "red", name = "log Exposure") + xlab("Longitude") + ylab("Latitude")
The Density represents here the density of the population at the place of residence. Let us take a look at the densities in the dataset.
summary(dataset$Density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 67 287 1982 1408 27000
ggplot(dataset, aes(Density)) + geom_histogram(bins = 200)
Here, contrary to the age of the driver, or the age of the car, the density has lots of different values
length(unique(dataset$Density))
We can compute this by using the command above, and we get 1267.
Density.summary = ddply(dataset, .(Density), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
head(Density.summary)
| Density | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Freq |
|---|---|---|---|---|
| 2 | 9.07 | 13 | 1 | 0.11025 |
| 3 | 58.94 | 81 | 2 | 0.03393 |
| 4 | 18.22 | 29 | 1 | 0.05490 |
| 5 | 67.99 | 122 | 7 | 0.10295 |
| 6 | 131.76 | 207 | 8 | 0.06072 |
| 7 | 182.77 | 288 | 7 | 0.03830 |
We can plot the observed claim frequencies…
ggplot(Density.summary, aes(x = Density, y = Obs.Claim.Freq)) + geom_point()
… but realize it is impossible to see a trend. One way out is to categorize the variable. We will see later (GAM) that it is possible to estimate a smooth function, which avoid the arbitrary categorization.
We can categorize the variable using the function cut.
dataset$DensityCAT = cut(dataset$Density, breaks = quantile(dataset$Density,
probs = seq(from = 0, to = 1, by = 0.1)), include.lowest = TRUE)
table(dataset$DensityCAT)
##
## [2,28] (28,51] (51,91]
## 20813 20571 20805
## (91,158] (158,287] (287,554]
## 19986 20584 20508
## (554,1.16e+03] (1.16e+03,2.4e+03] (2.4e+03,4.35e+03]
## 20571 20541 21290
## (4.35e+03,2.7e+04]
## 19763
levels(dataset$DensityCAT) <- LETTERS[1:10]
Then, we can apply the same strategy as above.
Density.summary = ddply(dataset, .(DensityCAT), summarize, totalExposure = sum(Exposure),
Number.Observations = length(Exposure), Number.Claims = sum(ClaimNb), Obs.Claim.Freq = sum(ClaimNb)/sum(Exposure))
Density.summary
| DensityCAT | totalExposure | Number.Observations | Number.Claims | Obs.Claim.Freq |
|---|---|---|---|---|
| A | 13499.50 | 20813 | 678 | 0.05022 |
| B | 12959.15 | 20571 | 738 | 0.05695 |
| C | 12695.77 | 20805 | 769 | 0.06057 |
| D | 12001.69 | 19986 | 754 | 0.06282 |
| E | 12166.47 | 20584 | 798 | 0.06559 |
| F | 11918.67 | 20508 | 845 | 0.07090 |
| G | 11065.29 | 20571 | 904 | 0.08170 |
| H | 10670.17 | 20541 | 897 | 0.08407 |
| I | 9707.02 | 21290 | 922 | 0.09498 |
| J | 8331.80 | 19763 | 735 | 0.08822 |
Using the function twoord.plot we can easily show both the Exposures and the observed claim frequencies on the same plot.
twoord.plot(1:10, Density.summary$totalExposure, 1:10, Density.summary$Obs.Claim.Freq,
xlab = "Density (categorized)", lylim = c(0, 15000), rylim = c(0, 0.15),
type = c("bar", "p"), xticklab = Density.summary$Density, ylab = "Exposure",
rylab = "Observed Claim Frequency", lytickpos = seq(0, 15000, 5000), rytickpos = seq(0,
0.15, 0.03))